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Abstract 

We  propose  and  evaluate  a  class  of  objective 
functions  that  rank  hypotheses  for  feature  la¬ 
bels.  Our  approach  takes  into  account  the 
representation  cost  and  quality  of  the  shapes  . 
themselves,  and  balances  the  geometric  require¬ 
ments  against  the  photometric  evidence.  This 
balance  is  essential  for  any  system  using  un¬ 
derconstrained  or  generic  feature  models.  We 
introduce  examples  of  specific  models  allowing 
the  actual  computation  of  the  terms  in  the  ob¬ 
jective  function,  and  show  how  this  framework 
leads  naturally  to  control  parameters  that  have 
a  clear  semantic  meaning.  We  illustrate  the 
properties  of  our  objective  functions  on  syn¬ 
thetic  and  real  images. 

1  Introduction 

All  approaches  to  the  problem  of  extracting  features 
from  images  can  in  principle  be  phrased  in  terms  of  de¬ 
cision  theory;  however,  the  concepts  of  decision  theory 
are  very  hard  to  put  into  practice  because  of  the  diffi¬ 
culty  of  evaluating  the  required  probability  measures. 
Therefore,  most  practical  approaches  to  model-based 
vision  for  both  specific  models,  e.g.,  [Binford,  1982, 
Bolles  and  Horaud,  1986,  Brooks,  1981,  Shneier  ei  ai, 
1986],  and  generic  models,  e.g.,  [Fischler  ei  ai,  1981, 
Ohta  et  ai,  1979,  McKeown  and  Denlinger,  1984,  Huer¬ 
tas  and  Nevatia,  1988],  rely  on  heuristic  measures  to 
select  among  competing  scene  parses.  These  methods, 
although  they  may  be  effective  in  the  context  for  which 
they  were  designed,  are  extremely  hard  to  extend  and 
require  the  use  of  many  parameters  whose  significance  is 
not  clearly  understood. 

On  the  other  hand,  approaches  such  as  those  of 
Feldman  and  Yakimovsky  [1974]  Georgeff  and  Wallace 
[l984],  and  Rissanen  [1983,  1987]  provide  a  sound  theo¬ 
retical  basis  for  the  decision  problem  but  offer  few  prac¬ 
tical  computational  methods  for  dealing  with  complex 
scenes  in  real  images. 

In  this  paper,  we  focus  on  an  objective  function  ap¬ 
proach  to  the  task  of  ranking  scene-labeling  hypotheses. 

‘This  research  was  supported  in  part  by  the  Defense 
Advanced  Research  Projects  Agency  under  Contract  Nos. 
MDA903-86-C-0084  and  DACA76-85-C-0004. 


For  brevity,  we  omit  discussion  of  the  related  problem 
of  hypothesis-generation,  and  refer  the  reader  to  [Fua 
and  Hanson,  1989].  We  define  a  class  of  objective  func¬ 
tions  based  upon  theoretical  arguments  similar  to  those 
of  Georgeff,  Wallace  and  Rissanen,  and  show  that  the 
required  probability  estimates  can  actually  be  computed 
in  the  context  of  a  few  natural  assumptions. 

Our  formulation  has  many  desirable  features,  but  is 
not  by  itself  a  complete  solution  to  the  feature  extrac¬ 
tion  problem.  To  be  effective  it  must  be  coupled  with  a 
robust  hypothesis  generation  mechanism  and  an  efficient 
optimization  procedure.  Furthermore,  one  would  like  to 
have  models  for  geometric  quality  analysis  much  more 
complex  than  those  presented  here.  It  should  come  as 
no  surprise  that  discovering  good  models  and  hypothesis- 
generation  strategies  are  the  most  difficult  tasks  in  the 
development  of  a  system  attempting  to  perform  shape 
perception.  The  strength  of  our  approach  is  that  it  pro¬ 
vides  a  unified  framework  that  clearly  exposes  the  criti¬ 
cal  components  and  characteristics  of  model-based  vision 
systems. 

2  Derivation  of  the  Objective  Function 

The  goal  of  feature  extraction  is  to  parse  a  scene  in 
terms  of  objects  conforming  to  particular  models.  To 
discriminate  among  competing  parses,  an  objective  func¬ 
tion  must  be  able  to  measure  the  goodness  of  fit  to 
feature  models  that  include  such  characteristics  as  area 
photometry,  edge  photometry,  shape,  and  semantic  re¬ 
lationships.  In  this  section,  we  define  a  basic  class  of 
models,  discuss  the  parameters  we  expect  to  control  our 
objective  functions,  derive  the  theoretical  forms  of  the 
objective  functions  themselves,  and  provide  an  interpre¬ 
tation  of  the  resulting  functions  in  terms  of  information 
theory. 

2.1  Object  Modeling 

For  the  purposes  of  this  work,  we  define  a  model  to  be  a 
geometric  description  of  an  object  in  the  world  charac¬ 
terized  by  its  geometric  constraints  and  its  photometric 
signature;  we  define  the  evidence  for  such  objects  in  dig¬ 
ital  images  to  be  a  collection  of  delineated  areas  corre¬ 
sponding  to  major  object  parts,  together  with  associated 
quantities  directly  derivable  from  the  pixel  values  in  such 
areas. 


We  interpret  the  photometric  signature  of  any  ob¬ 
ject  model  in  terms  of  the  expected  signal  from  an 
ideal  object  model  plus  a  noise  model  [Rissanen,  1983, 
Rissanen,  1987,  Lecierc,  1989].  The  object’s  evidence 
can  then  be  encoded  in  terms  of  these  models.  We  will 
use  length  of  the  shortest  encoding  to  measure  the  qual¬ 
ity  of  the  fit  between  the  data  and  the  model. 

2.2  Essential  Parameters  of  the  Objective 
Function 

Our  approach  introduces  two  fundamental  parameters, 
the  scale  and  the  shape  coefficient: 

Scale.  The  scale  is  interpretable  as  the  unavoidable 
dimensional  factor  that  converts  dimensional  quantities 
such  as  area  or  length  into  dimensionless  probabilities. 
Area  units  are  thus  scaled  down  by  two  powers  of  the 
dimensional  unit,  while  length  terms  such  as  edges  are 
scaled  down  by  a  single  power.  The  scale  parameter 
thus  controls  whether  the  area  signature  dominates  edge 
signature. 

The  scale  parameter  may  also  be  understood  by  ob¬ 
serving  that  when  an  image  is  resampled  or  zoomed, 
the  area  A  of  a  patch  will  change,  but  the  complex¬ 
ity  of  the  patch,  as  reflected  in  its  minimal  encoding, 
should  remain  invariant.  Thus  there  should  be  some  in¬ 
trinsic  zoom  factor  s  that  relates  the  area  A  to  the  area 
Ao  =  A/s^  in  the  zoomed  image  that  has  exactly  the  res¬ 
olution  needed  to  encode  the  model  complexity  without 
oversampling.  The  formulas  presented  later  in  the  paper 
may  thus  be  alternatively  interpreted  as  expressing  the 
patch  encoding  cost  in  terms  of  the  sampling-invariant 
quantity  Ao  instead  of  A  itself. 

Shape  Coefficient.  An  objective  function  with  a 
shape  quality  term  alone  will  retain  any  candidate  model 
instance  with  the  appropriate  geometry,  even  if  it  does 
not  fit  the  image  data.  On  the  other  hand,  an  objec¬ 
tive  function  with  only  a  photometric  model  will  make 
the  same  class  of  errors  as  a  segmentation  algorithm. 
The  shape  coefficient  balances  the  possibly  conflicting 
requirements  of  the  geometry  and  photometry;  the  point 
where  this  balance  lies  must  be  determined  by  the  con¬ 
text  of  the  application. 

The  scale  and  shape  coefficients  characterize  the  fun¬ 
damental  balance  of  influences  that  must  be  semanti¬ 
cally  specified  for  each  application.  Within  a  particular 
model  domain,  it  seems  possible  in  principle  to  estimate 
the  scale  by  using  measures  of  local  complexity.  Our  ap¬ 
proach  to  feature-hypothesis  evaluation  provides  a  clear 
way  to  justify  and  understand  the  essential  role  of  these 
two  parameters  in  feature  extraction,  regardless  of  the 
other  details  of  a  particular  system. 

2.3  The  Probability  of  a  Scene  Parse 

We  choose  to  describe  the  problem  of  determining  the 
best  image  interpretation  as  the  need  to  maximize  the 
probability  P  =  p(momi  . .  .Tnn|ei  . .  .e„)  that,  given  the 
evidence  £  =  {e,-;  i  =  1 . .  .n),  parsing  the  scene  in  terms 
of  a  particular  set  of  model  instances  M  =  {m,-;  i  — 


1 . .  .n)  and  a  backround  niQ  is  in  fact  correct.*  Each  m,- 
is  taken  to  be  a  geometric  model  instance,  while  e,-  is  the 
measurable  evidence  for  the  object,  typically  a  collection 
of  associated  pixel  intensities.  Since  we  are  interested 
in  feature  extraction,  we  do  not  explicitly  represent  the 
background  and  collect  no  evidence  for  it. 

It  is  essentially  impossible  to  evaluate  the  conditional 
probability  P  in  its  most  general  form,  so  we  make  a 
crucial  independence  assumption:  the  probability  of  a 
particular  model  hypothesis  is  influenced  only  by  its  cor¬ 
responding  body  of  evidence  and  the  other  model  in¬ 
stances.  For  example,  in  an  aerial  image,  whether  or 
not  a  patch  of  pixels  can  be  identified  as  a  road  may 
depend  on  its  own  photometry  and  on  the  presence  or 
absence  of  neighboring  houses,  but  not  on  the  particular 
photometry  of  those  houses. 

Formally,  this  assumption  can  be  written  as  follows: 
If  I,J,K  denote  sets  of  indices  referring  to  model  in¬ 
stances  and  their  corresponding  bodies  of  evidence,  we 
assume  such  that  .7  H  7  =  0  and  3  r\  K  =  0, 

P{mjeK\ot)  =  P{mjtK),  and  V7,J,  P{mj\mjei)  = 

The  assumption  may  break  down  when  one  object’s 
expected  photometry  is  strongly  modified  by  another  ob¬ 
ject,  as  when  a  superstructure  or  a  separate  building  oc¬ 
cludes  or  casts  a  shadow  on  a  roof.  In  practice,  one  can 
partially  compensate  for  such  phenomena  by  discounting 
small  anomalies. 

Combining  our  assumption  with  Bayes’  rule,  it  is 
straightforward  to  express  the  probability  of  the  parse 
as:  * 

P  =  p(mom2  ...Tn„|ei...e„) 

=  (1) 

This  expression  clearly  separates  the  contribution  of 
the  photometry,  in  the  evidence-dependent  terms,  from 
the  abstract  contribution  of  the  geometric  and  semantic 
component  in  p(momi  . . .  mn)  under  the  stated  assump¬ 
tion.  We  further  expand  this  term  as; 

p(Tnomi  =  p(mo|mi  . .  .m„)p(mi  . . .  m„) 

=  Pop(mi . .  .m„),  (2) 

where  p(mi  . .  .m„)  is  the  probability  that  these  n  mod¬ 
els  appear  in  the  scene,  and  Pq  is  the  probability  that 
no  other  models  appear.  Since  we  do  not  take  the  back¬ 
ground  explicitly  into  account  in  this  work,  we  consider 
Po  to  be  constant. 

2.4  Minimal  Encoding  Length  and  Model 
Effectiveness 

We  choose  to  express  the  quality  of  a  parse  as  the  (base 
2)  logarithm  ^  of  Eq.  (1).  Classical  information  theory 
[Shannon,  1948,  Hamming,  1985]  leads  us  to  interpret 
the  resulting  score  S  in  terms  of  encoding  length: 

5  = -h  log  I- =  F-G,  (3) 


*For  example,  in  terms  of  a  human  analyst’s  perception, 
or  in  terms  of  ground  truth. 

^All  logarithms  in  this  paper  are  base  2  logarithms. 


where  we  define 

F  =  =  ^{-logp(e.)  +  logp(eilmi)}  (4) 

i  =  l  Ii:l 

G  =  -logp(Tni  ...Tn„).  (5) 

Here  F  is  what  we  call  the  encoding- eff^eciiveness  of  the 
set  of  models.  The  first  term  in  F  is  the  number  of  bits 
needed  to  describe  the  evidence  in  the  absence  of  the 
model,  while  the  second  term  gives  the  number  of  bits 
needed  to  describe  the  evidence  in  terms  of  the  model. 
The  term  effectiveness  is  thus  motivated  by  the  fact  that 
F  represents  the  number  of  bits  saved  by  representing  the 
evidence  using  the  model,  and  the  fact  that  F  increases 
as  the  fit  improves. 

G  is  the  number  of  bits  needed  to  encode  the  evidence- 
free  model  representation  information,  and  quantifies  the 
elegance  of  the  chosen  set  of  model  instances  eis  well  eis 
their  dependencies. 

2.5  Remarks 

Feature  Extraction  Viewed  as  an  Optimization 
Problem.  The  problem  of  finding  the  best  parse  of  a 
scene  can  now  be  rephrased  as  the  problem  of  optimiz¬ 
ing  over  sets  of  hypotheses  evaluated  by  Eq.  (3).  Global 
optimization  corresponds  to  a  blind  search  procedure, 
which  searches  all  possibilities  without  attempting  to 
determine  which  candidates  are  more  likely  than  oth¬ 
ers.  In  practice,  the  search  space  may  be  far  too  large 
for  this  type  of  search.  Since  intelligent  heuristics  can 
overcome  this  drawback,  a  natural  way  to  design  an  ap¬ 
plication  system  is  to  incorporate  hypothesis-generation 
algorithms  that  project  from  the  space  of  all  possible  hy¬ 
potheses  onto  a  subspace  of  very  likely  hypotheses.  Such 
projections  have  the  side  effect  of  reducing  the  discrimi¬ 
natory  burden  placed  upon  the  objective  function. 

Generic  Models  Require  Photometric/Geometric 
Balance.  When  a  model’s  geometry  is  completely  de¬ 
termined  beforehand,  as  it  is  for  template-matching  ap¬ 
proaches  to  automatic  shape  recognition,  there  is  no  need 
for  the  geometric  information  component  of  the  objec¬ 
tive  function,  since  it  is  constant  and  maximum  like¬ 
lihood  analysis  alone  will  do.  The  geometric  terms  in 
the  objective  function  begin  to  play  a  critical  role  when 
we  utilize  models  defined  by  a  set  of  general  geometric 
constraints  in  place  of  a  specific  shape  template.  Such 
generic  models,  with  arbitrarily  large  numbers  of  param¬ 
eters,  require  objective  functions  like  ours  that  balance 
their  geometric  aspects  against  their  photometry. 

3  Photometry:  Computing  F 

Two  of  the  main  characteristics  of  an  object  in  an  im¬ 
age  are  its  interior  photometry  and  its  contrast  with  the 
background,  which  produces  edges.  Here  we  explore  sim¬ 
ple  models  for  the  area  and  for  the  edges  of  an  object 
that  have  proven  useful  in  analyzing  imagery.  When 
working  with  stereo  pairs  of  images,  we  also  incorporate 
a  stereoscopic  model,  and  compute  the  depth  parameters 
of  an  object  in  the  scene  by  optimizing  the  corresponding 
stereo  effectiveness. 


We  have  seen  that  the  effectiveness  F  is  computed  as 
—  logp(e)  -I-  logp(e]Tn)  where  e  represents  the  grey  level 
values  of  the  pixels  that  are  enclosed  by  the  contour  m. 
For  the  sake  of  exposition,  let  us  distinguish  the  evidence 
Cyi  relative  to  the  interior  of  the  patch  and  the  evidence 
ce  relative  to  the  boundary.  Formally,  we  can  write; 

p(e|m)  =  p(e4|m)p(e£|m,e4) 

p(e)  =  p{tA)p[tE\eA)  ■ 

We  assume  that  contrast  with  the  background  can  be 
measured  by  using  local  image  derivatives,  while  ignor¬ 
ing  the  grey  levels  of  the  boundary  pixels.  This  contrast 
depends  on  the  grey  level  of  background  pixels  that  do 
not  appear  in  the  object  descriptions,  and  can  therefore 
be  considered  eis  independent  of  the  interior  object  pho¬ 
tometry.  Thus  we  write  Fi  in  Eq.  (4)  eis  the  sum  of  area 
and  edge  components: 

Fi  =  Fi^A  +  E,- £ 

Fi^A  -  -logp(e,i)-plogp(e>i|m) 

Fi^E  =  -logp(e£) -I- logp(eE|m)  . 

This  prescription  must  be  modified  when  dealing  with 
objects  that  share  edges,  since  the  contrast  of  the  shared 
edges  is  completely  determined  by  the  photometry  of  the 
regions  on  both  sides  of  the  edge.  In  this  case,  the  shared 
boundaries  do  not  contribute  to  the  edge  effectiveness 
term. 

When  additional  images  are  available  and  m  is  a  three- 
dimensional  model,  additional  evidence  es  can  be  gath¬ 
ered  using  the  projection  of  m  onto  each  image.  We 
write: 

p(e,es|m)  =  p(e|m)p(es|m,  e) 
p(e,es)  =  p(e)p(e5|e)  . 

In  the  cEise  of  a  pair  of  stereo  images,  e  is  the  evidence 
mesisured  in  the  left  image  and  es  the  corresponding  ev¬ 
idence  in  the  right  image  relative  to  the  model  projected 
into  that  image.  For  a  stereo  pair,  we  therefore  add  to 
the  effectiveness  a  stereo  effectiveness  term, 

F’s  = -logp(es|e)+ logp(es|m,e)  .  (6) 

3.1  Area  Model  for  Homogeneous  Regions 

We  model  the  interior  intensities  of  an  image  region  by  a 
smooth  intensity  surface  with  a  Gaussian  distribution  of 
deviations  from  the  surface.  Since  objects  in  real  images 
typically  have  anomalies  which  do  not  lie  on  the  smooth 
surface,  we  encode  such  anomalous  pixels  as  outliers.  As 
we  shall  see  later,  this  can  critically  enhance  the  discrim¬ 
inatory  power  of  the  area-encoding  effectiveness. 

In  the  application  of  our  approach  to  aerial  imagery, 
we  take  the  intensity  surface  to  be  a  plane.  In  Figure  1, 
we  show;  (a)  An  image  and  a  delineated  model  instance, 
(b)  The  histogram  of  deviations  from  the  planar  fit  to 
the  intensity  surface,  (c)  The  solid  white  area  indicating 
the  location  of  the  pixels  within  the  main  Gaussian  peak. 
Black  areas  within  the  model  outline  lie  outside  the  peak 
and  are  considered  anomalous. 

In  an  8-bit  image,  it  would  trike  8A  bits  to  encode 
the  pixel  values  if  we  did  not  take  advantage  of  depen¬ 
dencies  among  pixels.  Similarly,  it  would  take  IcaA  bits 


Figure  1:  (a)  Delineated  model  instance,  (b)  Histogram  of  deviations,  (c)  Anomalies. 


to  encode  the  same  information  using  our  region  model, 
where 


k^A-  n(log  (7  +  c)  +  8n  +  E{n,  n)  .  (7) 

Here  n(logo’+c)  is  the  cost  of  Huffmann-encoding  [Ham¬ 
ming,  1985]  the  pixels  in  a  Gaussian  peak,  8n  is  the  cost 
of  encoding  the  outliers,  and 


E{n,n)  =  - 


nlog-  +  nlog- 


(8) 


is  the  entropy,  i.e.,  the  cost  of  specifying  whether  a  pixel 
is  or  is  not  anomalous,  a  is  the  variance  of  the  Gaussian 
distribution,  n  is  the  number  of  pixels  in  the  Gaussian, 
n  =  A  —  n,  and  c  =  (1/2)  log(25re).  Note  that  in  the 
computation  of  the  encoding  cost,  we  have  not  included 
the  cost  of  encoding  the  six  internal  parameters  of  the 
model:  3  for  the  plane,  2  for  the  Gaussian,  and  one  for 
the  probability  n/A  that  a  pixel  lies  in  the  main  peak. 
It  can  be  shown  [Rissanen,  1983,  Schwarz,  1978]  that 
these  costs  are  approximately  equal  to  (1/2)  log  A  bits 
per  internal  parameter  of  the  statistical  distribution,  and 
are  therefore  negligibly  small  compared  to  ky^A. 

We  weight  all  areas  and  lengths  using  the  scale  pa¬ 
rameter  s  (see  section  2.2)  so  that  the  area-encoding  ef¬ 
fectiveness  becomes: 


Ei^A  =  bits(without  model)  —  bits(with  model) 

=  ^((8-c-logo-)n-£'(n,n))  ,  (9) 


Figure  2:  Area  and  edge  effectiveness  of  a  square  patch  as 
a  function  of  candidate  radius,  with  (solid)  and  without 
(dotted)  anomaly  discounting. 


results  obtained  after  discounting  anomalies  (solid  lines) 
with  those  results  found  without  anomaly  discounting 
(dotted  lines),  we  see  that  anomaly  discounting  must  be 
included  to  make  the  objective  function  reliably  select 
the  same  shape  a  human  observer  perceives.  This  is  po¬ 
tentially  a  critical  factor  in  the  practical  application  of 
this  approach  because,  as  we  see  in  Figure  1,  real  images 
nearly  always  have  significant  anomalous  components. 


Optimization  of  this  score  is  intuitively  appropriate  be¬ 
cause  it  finds  the  best  compromise  among  the  following: 

•  large  area  A, 

•  low  standard  deviation  a, 

•  small  number  of  anomalies  n. 

Effect  of  Anomaly  Discounting.  In  the  graphs  on 
the  left  in  Figure  2,  we  plot  the  area-encoding  effective¬ 
ness  Fa  as  a  function  of  the  radius  of  a  square  patch 
centered  at  the  center  of  the  images  shown  in  the  left 
column:  a  good  but  noisy  synthetic  image  of  a  square, 
the  same  image  with  gross  area  anomalies,  and  an  image 
of  a  similar  but  distorted  square.  When  we  compare  the 


Note  that  we  only  have  local  maxima  of  the  area¬ 
encoding  effectiveness  appearing  in  Figure  1;  for  large 
radii,  a  better  parse  of  the  scene  would  be  in  terms 
of  two  model  hypotheses,  one  square  and  one  square¬ 
shaped  ring  covering  the  rest  of  the  image,  rather  than 
one  square  plus  random  background.  From  this  example, 
we  see  that  high  score  alone  is  not  an  adequate  criterion; 
we  must  also  require  local  maximality  when  dealing  with 
a  partial  description  of  the  scene  as  opposed  to  a  global 
one.  For  this  reason  it  is  important  in  practice  to  mea¬ 
sure  whether  a  candidate  object  passes  this  maximality 
test.  Experimentally,  we  have  found  that  high  edge  qual¬ 
ity  enforces  this  requirement;  we  now  turn  to  the  explicit 
form  of  the  edge  term  used. 


3.2  Edge  Model 

We  adopt  the  definition  [Rosenfeld,  1970,  Haralick,  1984, 
Canny,  1986]  of  edge  pixels  as  maxima  of  the  local  image 
derivative,  and  we  classify  edges  according  to  whether  or 
not  an  edge  boundary  pixel  conforms  to  this  definition. 
In  the  absence  of  a  model,  it  would  take  1  bit  per  pixel  to 
encode  this  information.  If  we  now  use  the  1-parameter 
model  that  takes  into  account  the  proportion  of  maxi¬ 
mal  edge  pixels,  the  most  efficient  Huffmann  [Hamming, 
1985]  code  for  this  information  would  require 

,  [n  ,  n  n  ,  n1 

/••£  =  - [jlogj  +  jlog-J  (10) 

bits  per  boundary  pixel,  where  L  is  the  length  of  patch 
boundary  in  pixels,  n  is  the  number  of  boundary  pixels 
that  are  maxima  of  the  local  image  gradient,  and  n  = 
L  —  n. 

We  then  weight  all  lengths  by  the  scale  factor  s  and 
estimate  the  edge-encoding  effectiveness  to  be 

F{,B  =  bits(without  model)  —  bits(with  model) 

=  (I-M7  ■  (11) 

As  in  the  case  of  the  area  term,  we  have  neglected  the 
(1/2)  log(L/s)  bits  required  to  encode  the  one  internal 
parameter  of  the  model  [Rissanen,  1983,  Schwarz,  1978]. 

As  shown  in  the  right  column  of  Figure  2,  this  edge 
score  is  maximal  when  all  boundary  pixels  conform  to 
our  edge  model,  and  degrades  as  the  proportion  of  such 
pixels  diminishes.  This  model  has  proven  effective  in  our 
application  of  these  techniques  to  aerial  images  because 
it  provides  a  measure  of  edge-quality  that  does  not  in¬ 
clude  an  image-dependent  threshold  on  edge  strength. 

We  have  also  experimented  with  an  edge  model  that 
requires  the  gradient  direction  be  normal  to  the  object 
outline,  and  computes  the  encoding  cost  of  deviations 
from  the  normal  vector.  Both  models  yield  similar  rank¬ 
ings. 

3.3  Stereography 

The  simplest  stereo  model  assumes  that  corresponding 
pixels  have  the  same  grey-levels  in  both  images.  In  prac¬ 
tice,  to  compute  the  stereo  effectiveness  of  Eq.  (6),  we 
determine  the  number  of  bits  required  to  encode  the  pro¬ 
jected  patch  in  the  second  image,  while  knowing  its  pho¬ 
tometry  in  the  first.  We  compute  the  deviations  of  the 
intensities  from  their  predicted  values  and  encode  them 
using  the  same  Gaussian  model  with  anomalies  that  we 
used  for  the  area  term.  The  anomaly  discounting  is  re¬ 
quired  because  of  the  possibility  of  occlusions.  We  also 
take  into  account  the  edge  quality  of  the  contour  in  the 
second  image  and  its  edge-encoding  effectiveness. 

The  stereographic  effectiveness  term  Fs  is  therefore 
the  sum  of  an  edge  and  an  area  term: 

Fs  =  F/is  +  F£5  (12) 

Fas  =  (8-fc^.)^ 

Fbs  =  — 

s 


where  A2  is  the  area  of  the  projected  patch  in  the  second 
image,  L2  is  its  boundary  length,  and  ^a,  and  are 
the  corresponding  model  encoding  costs. 

We  can  use  the  effectiveness  measure  (12)  to  opti¬ 
mize  the  elevation  parameters  of  a  two-dimensional  de¬ 
lineation  found  in  the  first  image.  The  search  space  is  ex¬ 
tremely  constrained  since  the  projected  shape  is  known 
and  the  only  degree  of  freedom  is  epipolar  motion  in  the 
second  image. 

Let  us  consider  the  stereo  pair  of  images  in  Figure 
3(a,c).  Assuming  that  the  roof  is  horizontal,  we  plot  in 
Figure  3(b)  the  value  of  F5  as  a  function  of  the  assumed 
disparity  between  the  candidate  outline  in  the  left  image 
(a)  and  the  projected  outline  in  the  right  image.  We  note 
that  Fs  has  a  sharp  peak  for  the  correct  match  outlined 
in  (c). 


Figure  3;  (a)  Roof  candidate  in  left  image  of  a  stereo 
pair,  (b)  Fs  as  a  function  of  the  assumed  disparity  be¬ 
tween  left  and  right  dmage.  (c)  The  projection  of  the 
contour  in  the  right  image  using  the  best  disparity  value. 


4  Geometry:  Computing  G. 

The  geometric  cost  G  defined  by  Eq.  (5)  is  a  measure  of 
quality  of  a  set  of  object  hypotheses.  The  simplest  ^vay 
to  handle  dependencies  among  objects  is  to  require  that 
there  be  no  conflicts  within  a  particular  set  of  hypothe¬ 
ses;  formally  we  write; 

p(m,  |mj  )  =  p(m,)  if  m,-  nmy  =  0  or  m,-  C  rrij 

=  0  otherwise 

p(mi...Tnn)  =  JJp(rJii)  if  no  conflict 
i 

=  0  otherwise. 

It  follows  that  G  can  be  expressed  as 

n 

G  =  -logp(Tni  =  7^Gi,  (13) 

i=l 

where  G;  oc  —  logp(Tn,)  is  a  model  quality  measure  that 
increases  as  the  shape  degrades,  and  y  is  the  arbitrary 
shape  coefficient. 

Now  we  can  deduce  a  mechanism  for  deciding  whether 
or  not  the  addition  of  one  more  feature  object  is  advan¬ 
tageous  or  detrimental  to  the  overall  parse.  If  we  write 
the  overall  score  in  the  form 

F  =  E(Fi  -  jGi), 

i=l 


we  conclude  that  we  should  accept  only  mode!  instances 
with  {Fi  —  jGi)  >  0,  since  these  are  the  only  ones  that 
improve  the  likelihood  of  the  full  scene  parse. 

The  simplest  effective  model  for  G,-  is  the  sum  of  the 
cost  of  chain-encoding  the  boundary  of  the  object’s  area 
plus  a  constant  cost  for  introducing  a  new  object;  this 
gives  a  geometric  cost 

G,-=c-f— .  (14) 

s 


(a)  (b) 


Figure  4:  (a)  Ratio  of  single-square  to  double-rectangle 
score  as  a  function  of  noise  variance  (40,  20,  10).  (b) 
Similar  plot  comparing  the  score  of  the  square  interpre¬ 
tation  to  the  “U”  interpretation. 

In  Figure  4(a),  we  show  how  the  length  term  (14), 
which  gives  preference  to  compact  objects,  influences 
the  parse  when  a  split  square  is  interpreted  alternately 
as  a  single  compact  square  or  two  adjacent  rectangles. 
The  bottom  graph  takes  three  images,  with  noise  vari¬ 
ance  40,  20  and  10,  and  plots  the  ratios  (two-rectangle 
score)/(square  score)  as  a  function  of  scale  for  fixed 
7  =  1.  Note  that  increasing  the  scale  in  this  example 
amounts  to  looking  at  a  subsampled  image  in  which  fine 
details  are  no  longer  visible.  The  interesting  value  of  the 
scale  is  that  for  which  the  scores  are  equal,  i.e.,  the  ratio 
is  one.  Thus  we  plot  in  the  upper  graphs  the  locus  of 
points  where  the  ratio  is  unity  as  a  function  of  7  as  well 
as  scale.  In  Figure  4(b),  we  carry  out  a  similar  plot  for 
an  image  of  a  square  with  a  missing  portion  that  makes  it 
“U”-shaped.  We  see  that  the  ratio  (“U”  score)/(square 
score)  behaves  so  that  the  square  interpretation  is  pre¬ 
ferred  at  a  large  scale  in  the  best  image,  and  at  a  much 
lower  scale  in  the  noisier  images. 

5  Exaniples 

We  have  applied  the  principle  of  objective- function  opti¬ 
mization  to  operator-initiated  shape  extraction  and  to 
automated  extraction  of  generic  cartographic  features 
such  as  buildings  from  aerial  imagery,  described  else¬ 
where  [Fua  and  Hanson,  1989].  In  the  automated  ap¬ 


plication,  we  use  an  hypothesis  generator  that  carries 
out  the  following  steps:  (1)  extract  linked  edges;  (2) 
find  edges  obeying  geometric  constraints  (such  as  recti- 
linearity)  that  define  enclosed  regions  in  the  image;  (3) 
compute  the  score  of  each  enclosed  area  using  the  objec¬ 
tive  function;  (4)  find  the  subset  of  nonconflicting  shape 
candidates  maximizing  the  total  score.  One  may  also 
optimize  each  candidate  shape  with  respect  to  the  ob¬ 
jective  function  before  the  final  ranking. 

The  objective  function  plays  a  crucial  role  in  this  ap¬ 
plication  because  the  hypothesis  generator  will  always 
produce  conflicting  sets  of  candidates,  and  a  means  of 
distinguishing  among  these  is  absolutely  essential. 


Figure  5:  (a)  A  complex  building,  (b)  Interpretation  in 
terms  of  a  single  polygon,  (c)  Interpretation  in  terms  of 
two  polygons. 


For  example,  for  Figure  5(a),  the  system  produces  two 
conflicting  interpretations;  one  in  terms  of  a  single  poly¬ 
gon  enclosing  both  wings  as  in  Figure  5(b)  the  other  in 
terms  of  two  polygons,  one  for  each  wing  as  in  F)igure 
5(c).  At  low  scale  the  latter  will  be  preferred  because  of 
its  better  fit  to  the  photometric  data,  while  at  high  scale 
the  former  will  dominate  due  to  its  lower  geometric  cost. 

In  Figure  6,  we  show  the  hypotheses  generated  and 
retained  by  the  system  for  scale  values  of  6,  7  and  8, 
with  fixed  shape  coefficient;  for  this  scene,  scale  8  clearly 
gives  the  best  parse. 

From  the  examples  shown  in  this  section,  we  can  form 
an  intuitive  understanding  of  the  scale  parameter;  s 
tunes  the  scale  noi  of  the  physical  size  of  the  object, 
but  the  scale  of  its  quality.  Objects  with  close  fits  to  the 
strict  model  are  selected  first  as  we  ramp  the  scale  down 
from  a  high  value. 

6  Conclusion 

In  this  work,  we  have  shown  how  an  information  the¬ 
oretic  approach  to  the  feature  extraction  problem  can 
be  formulated  in  such  a  way  as  to  permit  realistic  com¬ 
putational  techniques  for  the  required  probability  esti¬ 
mates.  Our  approach  provides  a  firm  theoretical  basis 
for  understanding  complex  feature  extraction  problems 
that  require  a  balance  between  photometric  evidence  and 
geometric  quality.  Of  course,  the  objective  function  ap¬ 
proach  given  here  cannot  by  itself  lead  to  good  solutions 
to  the  feature  extraction  problem,  but  must  be  teamed 


Figure  6;  Aerial  image  of  suburban  buildings  parsed  at  scales  6,  7,  and  8. 


with  a  competent  (human  or  automated)  hypothesis  gen¬ 
erator  [Fua  and  Hanson,  1989].  Among  the  goals  of  fu¬ 
ture  work  will  be  the  extension  of  the  range  of  our  models 
and  the  treatment  of  complex  semantic  dependencies  in 
terms  of  their  information-theoretic  context. 
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